Contractions: Breaking the Tokenization-Tagging Circularity
نویسندگان
چکیده
Ambiguous strings are strings of non-whitespace characters, typically coinciding with orthographic contractions of word forms, that depending on the specific occurrence, are to be considered as consisting of one or more than one token. This sort of strings is shown to raise the problem of undesired circularity between tokenization and tagging. This paper presents a strategy to resolve ambiguous strings and dissolve such circularity. 1 Tokenizing-Tagging Circularity As a starting point, let us take orthographic contractions in the most widely studied natural language. In spite of the fact that in English, these items do not exhibit a large diversity, the few existing cases are illustrative examples of the need to identify different tokens out of a single sequence characters not including whitespaces. For instance, in the Penn Treebank, the sequences I’m and won’t are tokenized as |I|’m| and |wo|n’t|, respectively [2]. If for the sake of facilitating subsequent uniform syntactic processing of every verb form, orthographic normalization is sought, then reconstruction of correct orthographic form of the identified tokens is also an issue here and the strings I’m and won’t should be tokenized as |I|am| and |will|not|, respectively. In Portuguese orthography, there are several instances of orthographic contractions. Most relevant cases concern the contraction of a Preposition with the subsequent word. Some Prepositions contract with Articles and also Personal and Demonstrative Pronouns: pelo (por o), nele (em ele), desse (de esse). Besides Prepositions, also some Clitics either in proclisis or not, may be contracted with other clitics: lho (lhe o). In this connection, the crucial point to note is that some of these strings are ambiguous between a contracted and a non-contracted form (see list of ambiguous strings in Table 1). For instance, the string pelo is ambiguous between the first person singular of “Presente Indicativo” of Verb pelar and the contraction of the Preposition por and the Definite Article o (for the full list of ambiguities, see [1]). 168 A.H. Branco and J.R. Silva These items introduce circularity in the standard scheme of tokenizationfollowed-bytagging: They should have been previously tagged for them to be correctly tokenized – but for them to be tagged, they should have already been tokenized. For instance, to decide whether mas should be tokenized as |mas| or as |me|as| in a specific occurrence, it is necessary to know whether in that specific occurrence mas is a Conjunction or the contraction of two Clitics, respectively: This presupposes that the tagging process should have taken place, but for this process to have taken place, tokenization was expected to have been applied, and the circle is closed. The low level of the task of tokenization with regards to the spectrum of the NLP levels should not bias one to not recognize the importance of this problem. From a performance point of view of any NLP task, this is far from being a minor or merely curious issue. On the one hand, these ambiguous strings include potential tokens of functional categories, which are known to be of the utmost importance to constraint the syntactic environment where they occur and to help guiding the subsequent parsing process. On the other hand, these items have a non-negligible frequency, accounting for as much as around 2.1% of a large enough corpus, as the one we used in the experiment reported in the next section.1 A careless, low precision approach to tackle this issue, at such an earlier stage of processing, is very likely to imply a severe and unrecoverable damage in the overall performance of subsequent tasks, and in particular, in the immediately subsequent task of tagging. The importance of an accurate handling of the ambiguous strings can be shown in detail by taking into account their relative frequencies and the relative frequency of each of the two options for their tokenization in a corpus:2 Table 1. Distribution of ambiguous strings Strings Freq. One token Two tokens consigo 17 8 9 desse 33 6 27 desses 14 0 14
منابع مشابه
Tokenization of Portuguese: resolving the hard cases
This research note addresses the issue of ambiguous strings, strings of non-whitespace characters whose tokenization, depending of the specific occurrence, yields one or more than one token. This sort of strings, typically coinciding with orthographically contracted forms, is shown to raise the problem of undesired circularity between tokenization and tagging, under the standard view that token...
متن کاملLTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging...
متن کاملThe Effect of Automatic Tokenization, Vocalization, Stemming, and {POS} Tagging on {A}rabic Dependency Parsing
We use an automatic pipeline of word tokenization, stemming, POS tagging, and vocalization to perform real-world Arabic dependency parsing. In spite of the high accuracy on the modules, the very few errors in tokenization, which reaches an accuracy of 99.34%, lead to a drop of more than 10% in parsing, indicating that no high quality dependency parsing of Arabic, and possibly other morphologica...
متن کاملSimultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer
We describe an approach to simultaneous tokenization and part-of-speech tagging that is based on separating the closed and open-class items, and focusing on the likelihood of the possible stems of the openclass words. By encoding some basic linguistic information, the machine learning task is simplified, while achieving stateof-the-art tokenization results and competitive POS results, although ...
متن کاملFlexible Text Segmentation with Structured Multilabel Classification
Many language processing tasks can be reduced to breaking the text into segments with prescribed properties. Such tasks include sentence splitting, tokenization, named-entity extraction, and chunking. We present a new model of text segmentation based on ideas from multilabel classification. Using this model, we can naturally represent segmentation problems involving overlapping and non-contiguo...
متن کامل